Gossip-based Failure Detection and Consensus for Terascale Computing

نویسنده

  • Rajagopal Subramaniyan
چکیده

of Thesis Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Master of Science GOSSIP-BASED FAILURE DETECTION AND CONSENSUS FOR TERASCALE COMPUTING By Rajagopal Subramaniyan May 2003 Chair: Alan D. George Department: Electrical and Computer Engineering One promising avenue of research on failure detection for large systems is the use of gossiping and other epidemic techniques to disseminate availability information. Gossip protocols and services provide a way to detect failures in large, distributed systems in an asynchronous manner without the limitations associated with reliable multicasting for group communications. However, the performance of these failuredetection protocols in terms of metrics like correctness, completeness and scalability need to be improved for use in terascale clusters. The consensus algorithm to detect failures in a system-wide fashion should be resilient enough to detect both individual node and group failures. Gossiping with consensus can take place throughout the system via a flat structure, or it can be hierarchically distributed across cooperating layers of nodes. This thesis presents a scalable multilayered gossip protocol, with features to strengthen the completeness and correctness of the failure-detection service. We examine the performance of the layered gossip protocol in terms of scalability on an experimental

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

DisTriB: Distributed Trust Management Model Based on Gossip Learning and Bayesian Networks in Collaborative Computing Systems

The interactions among peers in Peer-to-Peer systems as a distributed collaborative system are based on asynchronous and unreliable communications. Trust is an essential and facilitating component in these interactions specially in such uncertain environments. Various attacks are possible due to large-scale nature and openness of these systems that affects the trust. Peers has not enough inform...

متن کامل

DisTriB: Distributed Trust Management Model Based on Gossip Learning and Bayesian Networks in Collaborative Computing Systems

The interactions among peers in Peer-to-Peer systems as a distributed collaborative system are based on asynchronous and unreliable communications. Trust is an essential and facilitating component in these interactions specially in such uncertain environments. Various attacks are possible due to large-scale nature and openness of these systems that affects the trust. Peers has not enough inform...

متن کامل

Experimental Evaluation of a Failure Detection Service Based on a Gossip Strategy

Failure detectors were first proposed as an abstraction that makes it possible to solve consensus in asynchronous systems. A failure detector is a distributed oracle that provides information about the state of processes of a distributed system. This work presents a failure detection service based on a gossip strategy. The service was implemented on the JXTA platform. A simulator was also imple...

متن کامل

Epidemic Failure Detection and Consensus for Extreme Parallelism

Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum’s User Level Failure Mitigation proposal has introduced an operation, MPI Comm shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault ...

متن کامل

A Failure Detection Service Based on Epidemic Dissemination for Peer-to-Peer Networks

Failure detectors were first proposed as an abstraction that makes it possible to solve consensus in asynchronous systems. A failure detector is a distributed oracle that provides information about the state of processes of a distributed system. This work presents a failure detection service based on a gossip strategy. The service was implemented on the JXTA platform. A simulator was also imple...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002